Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
translated by 谷歌翻译
VITS通常太昂贵昂贵,无法安装在现实世界资源受限的设备上,因为(1)它们与输入令牌的数量和(2)其过度分开的自我关注头和模型深度相反的复杂性。并行地,不同的图像具有变化性变化,并且它们的不同区域可以包含各种级别的视觉信息,表明在模型复杂性方面同样地处理所有区域/令牌是不必要的,而这些机会尚未完全探索修剪vits的复杂性的机会。为此,我们提出了一种多粒子的输入 - 自适应视觉变压器框架被称为MIA-Fight,可以在三个粗粒细粒粒度(即,模型深度和模型数量的数量头/令牌)。特别是,我们的MIA-Agent采用具有混合监督和加固训练方法的低成本网络,以跳过不必要的层,头部和令牌以输入的自适应方式,降低整体计算成本。此外,我们的mia-ideor的有趣副作用是它的由此产生的vits自然地配备了对他们静态同行的对抗对抗攻击的改善的鲁棒性,因为米娅 - 以前的多粒度动态控制改善了模型多样性,类似于集合的效果因此,增加对抗所有子模型的对抗性攻击的难度。广泛的实验和消融研究验证了所提出的MIA - 前框架可以有效地分配适应性的计算预算与输入图像的难度增加,同时增加稳健性,实现最先进的(SOTA)精度效率权衡,例如20与SOTA动态变压器模型相比,%计算节省相同甚至更高的准确性。
translated by 谷歌翻译
功能在数学和许多科学分支中起着重要作用。随着计算机技术的快速开发,在这些年中介绍了计算函数分析的越来越多的研究,例如,快速的傅立叶变换,小波变换,曲线函数。但是,在这些方法中有两个主要问题:1)难以处理固定和非平稳,周期性和非周期性和非周期性,高阶和低阶的复杂功能; 2)难以概括从训练数据到测试数据的拟合功能。在本文中,解决两个主要问题的基于多元回归的函数拟合网络作为可预测的函数拟合技术。该技术构建网络包括三个主要部分:1)固定转换层,2)特征编码层,3)微调回归层。固定变换层识别输入功能数据的顺序,并将非平稳功能转换为固定函数。特征编码层将原始输入顺序数据编码为新的线性回归特征,该功能可以捕获顺序数据的结构和时间字符。然后,微调回归层将功能拟合到目标前方值。具有线性回归层和非线性回归层的拟合网络具有高质量的拟合结果和可推广的预测。数学函数示例和真实单词函数示例的实验验证了提出的技术的效率。
translated by 谷歌翻译
与传统的基于模型的故障检测和分类(FDC)方法相比,深神经网络(DNN)被证明对航空航天传感器FDC问题有效。但是,在训练中消耗的时间是DNN的过度,而FDC神经网络的解释性分析仍然令人难以置信。近年来,已经研究了一个称为基于图像缺陷的智能FDC的概念。这个概念主张将传感器测量数据堆叠到图像格式中,然后将传感器FDC问题转换为堆叠图像上的异常区域检测问题,这很可能很可能借用了机器视觉领域的最新进展。尽管在基于图像缺陷的智能FDC研究中声称有希望的结果,但由于堆叠图像的尺寸较低,使用了小的卷积核和浅DNN层,这阻碍了FDC性能。在本文中,我们首先提出了一种数据增强方法,该方法将堆叠的图像膨胀到更大的尺寸(与机器视觉领域中开发的VGG16网的通讯)。然后,通过直接对VGG16进行微调训练FDC神经网络。为了截断和压缩FDC净大小(因此其运行时间),我们在微调网上进行修剪。还采用了类激活映射(CAM)方法,以解释FDC NET的解释性分析以验证其内部操作。通过数据增强,VGG16的微调以及模型修剪,本文开发的FDC网络声称,在5个飞行条件下(运行时间26 ms),在4架飞机上,FDC精度为98.90%。 CAM结果还验证FDC Net W.R.T.它的内部操作。
translated by 谷歌翻译
在本文中,提出了一种新型的数据驱动方法,称为“增强图像缺陷”,用于飞机空气数据传感器(AD)的故障检测(FD)。典范飞机空气数据传感器的FD问题,开发了基于深神经网络(DNN)的边缘设备上的在线FD方案。首先,将飞机惯性参考单元测量作为等效输入,可扩展到不同的飞机/飞行案件。收集了与6种不同的飞机/飞行条件相关的数据,以在培训/测试数据库中提供多样性(可伸缩性)。然后提出了基于DNN的飞行条件预测的增强图像缺乏。原始数据被重塑为用于卷积操作的灰度图像,并分析并指出了增强的必要性。讨论了不同种类的增强方法,即翻转,重复,瓷砖及其组合,结果表明,在图像矩阵的两个轴上的所有重复操作都会导致DNN的最佳性能。基于GRAD-CAM研究了DNN的可解释性,这提供了更好的理解并进一步巩固DNN的鲁棒性。接下来,DNN型号,具有增强图像缺陷数据的VGG-16将针对移动硬件部署进行了优化。修剪DNN后,具有高精度(略微上升0.27%)的轻质模型(比原始VGG-16小98.79%),并获得了快速速度(时间延迟减少87.54%)。并实施了基于TPE的DNN的超参数优化,并确定了超参数的最佳组合(学习速率0.001,迭代时期600和批次尺寸100的最高精度为0.987)。最后,开发了基于Edge设备Jetson Nano的在线FD部署,并实现了飞机的实时监控。我们认为,这种方法是针对解决其他类似领域的FD问题的启发性。
translated by 谷歌翻译
Hierarchical text classification aims to leverage label hierarchy in multi-label text classification. Existing methods encode label hierarchy in a global view, where label hierarchy is treated as the static hierarchical structure containing all labels. Since global hierarchy is static and irrelevant to text samples, it makes these methods hard to exploit hierarchical information. Contrary to global hierarchy, local hierarchy as a structured labels hierarchy corresponding to each text sample. It is dynamic and relevant to text samples, which is ignored in previous methods. To exploit global and local hierarchies,we propose Hierarchy-guided BERT with Global and Local hierarchies (HBGL), which utilizes the large-scale parameters and prior language knowledge of BERT to model both global and local hierarchies.Moreover,HBGL avoids the intentional fusion of semantic and hierarchical modules by directly modeling semantic and hierarchical information with BERT.Compared with the state-of-the-art method HGCLR,our method achieves significant improvement on three benchmark datasets.
translated by 谷歌翻译
部署各种深度学习(DL)型号有效地推动了DL编译器的研究。生成优化的张量码的难度驱动DL编译器以询问自动调整方法,并且越来越多的需求需要增加自动调整效率和质量。目前,DL编译器将输入DL模型分区为几个子图,并利用自动调整以找到这些子图的最佳张量代码。然而,现有的自学方法通常将子图视为个体,并且在其上忽略了它们的相似性,因此在有限的时间预算下未能利用更好的张力代码。我们向DL编译器提出Familyseer,即使有限的时间预算也可以生成更好的张量码。 Familyseer利用子图之间的相似性,并且子图之间的差异可以将它们组织成示例家庭,其中调整一个子图也可以改善同一家庭内的其他子图。每个家庭的成本模型获得了更多由家庭产生的纯化培训样本,并更准确,以便通过成本模型用轻量级估计来替换真正硬件上的昂贵测量。我们的实验表明,FamilySeer可以比最先进的自动调整框架更有效地生成模型代码,比最先进的自动调整框架更有效。
translated by 谷歌翻译
深度学习框架和硬件平台的蓬勃发展一直在要求一个有效的编译器,该编译器可以掩盖软件和硬件的多样性,以便提供应用程序可移植性。在现有的深度学习编译器中,TVM以其在各种硬件设备之间的代码生成和优化方面的效率而闻名。同时,Sunway多核处理器将其作为竞争性候选人,因为其在科学计算和深度学习工作负载中都有吸引力的计算能力。本文结合了这两个方向的趋势。具体来说,我们提出了SWTVM,该SWTVM扩展了原始TVM,以提前支持架构,以进行跨补偿,例如Sunway。此外,我们利用汇编过程中的体系结构特征,例如用于大规模并行性的核心组,用于高带宽内存传输的DMA和局部设备存储器的数据区域,以生成有效的代码,以在Sunway上进行深度学习工作负载。实验结果表明,与六个代表性基准相比,SWTVM生成的代码平均达到1.79倍。这项工作是从编译器角度进行的首次尝试,以弥合深度学习和Sunway处理器的差距,尤其是在生产力和效率方面。我们认为,这项工作将鼓励更多的人拥抱深度学习和Sunway多核处理器的力量。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译